1 Generalized Linear Models

Generalized linear models (GLMs) are a class of linear-based regression models developed to handle varying types of error distributions (other than normally (Gaussian) distributed). Common examples include responses that are binary (e.g., 0 and 1), or count data that are discrete values (integers) and never negative. While data transformations prior to fitting a model can be done, it may be difficult to properly meet model assumptions. We will show examples of how to perform data transformations as well as appropriate GLMs based on data properties.

1.1 One-Way ANOVA with non-normal data

We will first analyze data that came from a trail of different insecticide applications and the resultant number of insects. The four sprays A, B, C, and F are the focus of this analysis and are filtered from the original dataset. We will approach this analysis just as we had done with the ANOVA lecture.

Count_Data <- read_csv(here("data", "Count_Model.csv"))
## Rows: 72 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): spray
## dbl (1): count
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Count_Data <- Count_Data %>% 
  filter(spray == 'A' | spray == 'B' | spray == 'C' | spray == 'F') %>%
  droplevels()

Count_Data$spray <- as.factor(Count_Data$spray)

1.1.1 Data Visualization

It is always a good idea to visualize the data prior to the analysis. We are comparing the count of insects among the different treatment types. We are looking for a treatment that has the fewest number of insects observed, which would indicate the most effective treatment.

ggplot(Count_Data, aes(x = spray, y = count)) +
  geom_boxplot(outlier.shape = NA, width = 0.5) +
  geom_jitter(height = 0, width = 0.1, size = 3, pch = 21) +
  labs(x = "Spray Treatment Type", y = "Count") +
  theme_classic() +
  theme(axis.title = element_text(face = "bold", size = 15),
        axis.text = element_text(face = "bold", size = 15))

hist(Count_Data$count)

When we work with certain types of data - count data in this instance - the linear models often fail to meet the linear model assumptions. A visual observation of the data show that the distribution is not negative and not normally distributed. However, the linear model assumption of normality is not about the response variable but the errors or residuals of the model. This histogram can help guide what transformations might be helpful to meet linear model assumptions.

First we will fit a linear model with the non-normal data to show the distribution of the residuals.

lm1 <- lm(count ~ spray, data = Count_Data)
Anova(lm1, type = 2) # Appropriate since there is no interaction effect
Sum Sq Df F value Pr(>F)
spray 1648.729 3 26.47836 0
Residuals 913.250 44 NA NA
hist(resid(lm1)) # distribution should be symmetrical around zero

autoplot(lm1)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggfortify package.
##   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot(resid(lm1) ~ fitted(lm1)) + # funnel shapes or curvature are signs of heteroskedasticity or trend in residuals
abline(h = 0)

## integer(0)
qqPlot(resid(lm1)) # residuals should line up close to the blue line

## [1] 45 46
qqnorm(resid(lm1))
qqline(resid(lm1))

shapiro.test(resid(lm1))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(lm1)
## W = 0.96838, p-value = 0.2188
boxplot(resid(lm1) ~ Count_Data$spray) # variances should be homogeneous for each group

car::leveneTest(resid(lm1) ~ spray, data = Count_Data)
Df F value Pr(>F)
group 3 2.711731 0.0563515
44 NA NA

The variation between each spray treatment is not the same based on the boxplot, and it almost fails to meet the homogeneity of variance assumption. Generally not the worst example of count data failing to meet the normally distributed residuals assumption. We will move forward with the other modeling approaches to illustrate the lesson.

1.2 Log-linear Model

To model the effect of different sprays on insect counts that more closely meets the normally distributed residuals assumption using log-linear model by using the natural log-transformation on the response data (count). You need to add some small value to the variable before the natural log-transformation because log(0) is undefined. You could add any small constant but +1 is convenient because zeros go back to zero after the transformation.

lm2 <- lm(log(count + 1) ~ spray, data = Count_Data)
Anova(lm2, type = 2)
Sum Sq Df F value Pr(>F)
spray 29.365095 3 56.68159 0
Residuals 7.598377 44 NA NA
hist(resid(lm2))

autoplot(lm2)

plot(resid(lm2) ~ fitted(lm2)) +
  abline(h = 0)

## integer(0)
qqPlot(resid(lm2))

## [1] 27 25
qqnorm(resid(lm2))
qqline(resid(lm2))

shapiro.test(resid(lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(lm2)
## W = 0.97753, p-value = 0.4804
boxplot(resid(lm2) ~ Count_Data$spray)

car::leveneTest(resid(lm2) ~ spray, data = Count_Data)
Df F value Pr(>F)
group 3 3.051575 0.0382753
44 NA NA

The transformation makes the homogeneity of the residuals a little more consistent across treatment types according to the boxplot but fails the levene Test while passing the shapiro test for normally distributed residuals. Another step to complete the lesson.

1.3 Generalized Linear Model (GLM)

A generalized linear model works similarly to the traditional linear model we have been covering where we estimate the relationship between the explanatory variables and the response variables, but the new element to the model is known as the link function. This link function allows the residuals from the model to follow a different distribution (e.g., binomial, Poisson, etc.), which modifies the model assumptions. There is specific syntax used in the model functions to reflect the appropriate link function based on the type of data analyzed. For this example - we have count data - the Poisson distribution is often what is tested first in a generalized linear model.

Now we will use GLMs to examine the effect of the different sprays on the counts of insects. The glm() function is a general function that fits a generalized linear model by specifying the ‘family’ (error distribution). The default setting is ‘Gaussian’ distribution (normal).

glm1 <- glm(count ~ spray, data = Count_Data, family = poisson(link = "log"))
Anova(glm1, type = 2)
LR Chisq Df Pr(>Chisq)
spray 185.8313 3 0
Anova(glm1, type = 2, test.statistic = "F")
Sum Sq Df F value Pr(>F)
spray 185.83131 3 35.83224 0
Residuals 76.06351 44 NA NA

For GLMs, Anova returns a likelihood ratio test with a chi-square value. You can get an F-statistics by using a different function to fit the glm. However, the Wald Chisquare test tells you the same information that the F-test does in this scenario - there is significant variation among the different levels of spray.

We can call for the summary of the model results showing the estimates of the different spray - notice they are on the log-scale - not the response scale due to the log link for a Poisson distribution. We still need to review the model assumptions.

summary(lm1)
## 
## Call:
## lm(formula = count ~ spray, data = Count_Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3333 -2.3750 -0.5833  2.0625  9.3333 
## 
## Coefficients:
##             Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)  14.5000     1.3152  11.025 0.0000000000000302 ***
## sprayB        0.8333     1.8599   0.448              0.656    
## sprayC      -12.4167     1.8599  -6.676 0.0000000341855369 ***
## sprayF        2.1667     1.8599   1.165              0.250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.556 on 44 degrees of freedom
## Multiple R-squared:  0.6435, Adjusted R-squared:  0.6192 
## F-statistic: 26.48 on 3 and 44 DF,  p-value: 0.0000000006091
summary(lm2)
## 
## Call:
## lm(formula = log(count + 1) ~ spray, data = Count_Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95261 -0.25946  0.01131  0.32350  1.12683 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   2.6967     0.1200  22.480 < 0.0000000000000002 ***
## sprayB        0.0598     0.1696   0.353                0.726    
## sprayC       -1.7441     0.1696 -10.281    0.000000000000283 ***
## sprayF        0.1189     0.1696   0.701                0.487    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4156 on 44 degrees of freedom
## Multiple R-squared:  0.7944, Adjusted R-squared:  0.7804 
## F-statistic: 56.68 on 3 and 44 DF,  p-value: 0.000000000000003701
summary(glm1)
## 
## Call:
## glm(formula = count ~ spray, family = poisson(link = "log"), 
##     data = Count_Data)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  2.67415    0.07581  35.274 <0.0000000000000002 ***
## sprayB       0.05588    0.10574   0.528               0.597    
## sprayC      -1.94018    0.21389  -9.071 <0.0000000000000002 ***
## sprayF       0.13926    0.10367   1.343               0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 262.154  on 47  degrees of freedom
## Residual deviance:  76.323  on 44  degrees of freedom
## AIC: 273.93
## 
## Number of Fisher Scoring iterations: 5

1.3.1 Assumptions

When fitting a generalized linear model we assume a different distribution of the residuals other than normally distributed. We don’t need to bother with the shapiro or levene test. We do need to review the autoplot functions to ensure the independence of the residuals - look for trend in residuals.

autoplot(lm1) 

autoplot(lm2) 

autoplot(glm1) 

The results from the assessment of the residuals for the generalized linear model are better than the one-way ANOVA and the log-transformed count data. This is a better approach to model this type of data.

1.3.2 Compare marginal means for different models

We compare the emmeans among the different spray treatments for each of the linear models (un-transformed, log-transformed, and Poisson).

emmeans(lm1, ~ spray)
##  spray emmean   SE df lower.CL upper.CL
##  A      14.50 1.32 44   11.849    17.15
##  B      15.33 1.32 44   12.683    17.98
##  C       2.08 1.32 44   -0.567     4.73
##  F      16.67 1.32 44   14.016    19.32
## 
## Confidence level used: 0.95
emmeans(lm2, ~ spray, type = 'response') # type argument helps to back transform the values on the original scale
##  spray response    SE df lower.CL upper.CL
##  A        13.83 1.780 44    10.65     17.9
##  B        14.75 1.890 44    11.36     19.1
##  C         1.59 0.311 44     1.04      2.3
##  F        15.70 2.000 44    12.12     20.3
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log(mu + 1) scale
emmeans(glm1, ~ spray, type = 'response')
##  spray  rate    SE  df asymp.LCL asymp.UCL
##  A     14.50 1.100 Inf     12.50     16.82
##  B     15.33 1.130 Inf     13.27     17.72
##  C      2.08 0.417 Inf      1.41      3.08
##  F     16.67 1.180 Inf     14.51     19.14
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale

1.3.3 Multiple Comparisons

Familiar code to create the results for the multiple comparisons tests.

emmeans1 <- lm1 %>%
  emmeans(specs = "spray") %>%
  cld(adjust = "none", Letters = letters)
emmeans1
spray emmean SE df lower.CL upper.CL .group
3 C 2.083333 1.315158 44 -0.5671931 4.73386 a
1 A 14.500000 1.315158 44 11.8494735 17.15053 b
2 B 15.333333 1.315158 44 12.6828069 17.98386 b
4 F 16.666667 1.315158 44 14.0161402 19.31719 b
emmeans2 <- lm2 %>%
  emmeans(specs = "spray", type = "response") %>%
  cld(adjust = "none", Letters = letters)
emmeans2
spray response SE df lower.CL upper.CL .group
3 C 1.592459 0.3109964 44 1.035699 2.301491 a
1 A 13.831278 1.7791887 44 10.646095 17.887602 b
2 B 14.745307 1.8888374 44 11.363826 19.051616 b
4 F 15.704220 2.0038704 44 12.116801 20.272789 b
emmeans3 <- glm1 %>%
  emmeans(specs = "spray", type = "response") %>%
  cld(Letters = letters)
emmeans3
spray rate SE df asymp.LCL asymp.UCL .group
3 C 2.083333 0.4166664 Inf 1.407727 3.083181 a
1 A 14.500000 1.0992422 Inf 12.497944 16.822767 b
2 B 15.333333 1.1303883 Inf 13.270435 17.716910 b
4 F 16.666667 1.1785113 Inf 14.509743 19.144225 b
emmeans3_diff <- glm1 %>%
  emmeans(specs = "spray", type = "response") %>%
  cld(adjust = "none", Letters = letters, details = T)
emmeans3_diff
## $emmeans
##  spray  rate    SE  df asymp.LCL asymp.UCL .group
##  C      2.08 0.417 Inf      1.41      3.08  a    
##  A     14.50 1.100 Inf     12.50     16.82   b   
##  B     15.33 1.130 Inf     13.27     17.72   b   
##  F     16.67 1.180 Inf     14.51     19.14   b   
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the log scale 
## Tests are performed on the log scale 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same. 
## 
## $comparisons
##  contrast ratio    SE  df null z.ratio p.value
##  A / C     6.96 1.490 Inf    1   9.071  <.0001
##  B / C     7.36 1.570 Inf    1   9.364  <.0001
##  B / A     1.06 0.112 Inf    1   0.528  0.5972
##  F / C     8.00 1.700 Inf    1   9.803  <.0001
##  F / A     1.15 0.119 Inf    1   1.343  0.1792
##  F / B     1.09 0.111 Inf    1   0.816  0.4144
## 
## Tests are performed on the log scale

If you wanted to know the exact difference between the average count for the different treatment levels, you can call for the details in the cld function. We see the difference between the F and C treatment is 8 insects on average.

1.3.4 Data Visualization

emmeans1 <- emmeans1 %>%
  as_tibble()

emmeans3 <- emmeans3 %>%
  as_tibble()

Plot_lm1 <- ggplot() +
  geom_point(data = Count_Data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
  geom_boxplot(data = Count_Data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = emmeans1, aes(y = emmean, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = emmeans1, aes(ymin = lower.CL, ymax = upper.CL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = emmeans1, aes(y = emmean, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Counts", x = "Treatment") +
  theme_classic()
Plot_lm1

Plot_glm1 <- ggplot() +
  geom_point(data = Count_Data, aes(y = count, x = spray), position = position_jitter(width = 0.1)) + # dots representing the raw data
  geom_boxplot(data = Count_Data, aes(y = count, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = emmeans3, aes(y = rate, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = emmeans3, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = emmeans3, aes(y = rate, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Counts", x = "Treatment") +
  theme_classic()
Plot_glm1

We can interpret the results from the analysis similarly between the linear model and the generalized linear model using the Poisson distribution. The main difference that we can see in the data visualization is the size of the confidence intervals showing the mean separation between the different treatments.

We can see that the C spray treatment had the lowest number of insects compared to each other treatment. The difference was between ~ 7 to 8 insects fewer.

1.4 Overdispersion

The Poisson distribution is a unique distribution that is often used to model count data. There are specific conditions that describe the Poisson distribution that don’t always hold for count data - the assumption is that the mean and variance are equal. When the variance of the data is larger than the mean of the data the data are said to follow a negative binomial distribution. We have to determine if the variance is greater than the mean enough to determine if we should fit the model using a Poisson distribution or a negative binomial distribution. We will review the steps to assess these characteristics of count data.

The general approach is to fit a model with the Poisson distribution and check for overdispersion and then fit the Negative Binomial distribution for models with overdispersion.

1.4.1 Review the Data

data<-read_csv(here("data","Overdispersion.csv"))
## Rows: 1300 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): block, trt, spray, lead
## dbl (3): row, col, y
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
row col y block trt spray lead
1 1 1 B1 T1 N N
2 1 0 B1 T1 N N
3 1 1 B1 T1 N N
4 1 3 B1 T1 N N
5 1 6 B1 T1 N N
6 1 0 B2 T1 N N

1.4.2 Examine the Data

ggplot(data, aes(x = spray, y = y, fill = lead)) +
  geom_violin(scale = "width", adjust = 2) + # produces the violin geometry that shows the distribution of the data
  geom_point(position = position_jitterdodge(jitter.width = 0.5, # jitters the horizontal positioning of the data
                                             jitter.height = 0.1, # jitters the vertical positioning of the data - not always advised
                                             dodge.width = 1), # shifts the horizontal grouping of the data
             alpha = 0.3) +
  theme_classic(base_size = 14) # changes the size of text over all the plot

1.4.3 Fitting the Model

mod1 <- glm(y ~ spray * lead, data = data, family = "poisson")

summary(mod1)
## 
## Call:
## glm(formula = y ~ spray * lead, family = "poisson", data = data)
## 
## Coefficients:
##              Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)   0.33647    0.04688   7.177    0.000000000000712 ***
## sprayY       -1.02043    0.09108 -11.204 < 0.0000000000000002 ***
## leadY        -0.49628    0.07621  -6.512    0.000000000074139 ***
## sprayY:leadY  0.29425    0.13917   2.114               0.0345 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1955.9  on 1299  degrees of freedom
## Residual deviance: 1720.4  on 1296  degrees of freedom
## AIC: 3125.5
## 
## Number of Fisher Scoring iterations: 6
Anova(mod1, type = 3, test.statistic = "F") # type 3 sums of squares needed for interaction
Sum Sq Df F values Pr(>F)
spray 142.348722 1 105.076131 0.0000000
lead 43.721150 1 32.273203 0.0000000
spray:lead 4.452271 1 3.286489 0.0700835
Residuals 1755.716946 1296 NA NA

1.4.4 Check for Overdispersion

check_overdispersion(mod1)
## # Overdispersion test
## 
##        dispersion ratio =    1.355
##   Pearson's Chi-Squared = 1755.717
##                 p-value =  < 0.001
## Overdispersion detected.

A significant result p-value < 0.05 suggests evidence of overdispersion.

1.4.5 Fit Model with Negative Binomial Distribution

mod2 <- glm.nb(y ~ spray * lead, data = data)
Anova(mod2, type = 3, test.statistic = "F")
Sum Sq Df F values Pr(>F)
spray 98.112727 1 100.931739 0.0000000
lead 28.129281 1 28.937502 0.0000001
spray:lead 3.386873 1 3.484186 0.0621835
Residuals 1259.802870 1296 NA NA

1.4.6 Assumptions

Other methods to compare residuals for poisson and negative binomial models include functions from the DHARMa package.

#install.packages("DHARMa")
library(DHARMa)
## This is DHARMa 0.4.7. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
simulateResiduals(mod1,plot = T)
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.4574871 0.1249353 0.3765507 0.8898169 0.9947878 0.1384856 0.8427982 0.7401398 0.5451185 0.9247921 0.04867315 0.9158836 0.4210366 0.2450423 0.6110091 0.05695033 0.5344315 0.9367373 0.8666021 0.7861569 ...
simulateResiduals(mod2,plot = T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.5913531 0.3246744 0.5856685 0.8376787 0.991795 0.01855553 0.7269723 0.6891255 0.5103097 0.8653034 0.1887702 0.7998688 0.3835789 0.3235728 0.7524404 0.2983944 0.5828521 0.9131327 0.85387 0.8605788 ...
autoplot(mod2) 

No sign of dependency in the Residuals vs Fitted plot - not unusual to see non-normally distributed residuals here.

Results from model suggest that the interaction effect isn’t significant, but we can confirm with multiple comparisons tests.

1.4.7 Multiple Comparisons

Familiar code to create the results for the multiple comparisons tests.

all_means <- mod2 %>%
  emmeans(~ spray:lead) %>%
  cld(Letters = letters, type = "response", details = F)
all_means
spray lead response SE df asymp.LCL asymp.UCL .group
4 Y Y 0.4123077 0.0391105 Inf 0.3423564 0.4965517 a
2 Y N 0.5046154 0.0440863 Inf 0.4252010 0.5988620 a
3 N Y 0.8523077 0.0611373 Inf 0.7405229 0.9809668 b
1 N N 1.4000000 0.0855387 Inf 1.2419967 1.5781041 c

1.4.8 Data Visualization

all_means <- all_means %>%
  as_tibble()

Plot_NB <- ggplot() +
  geom_point(data = data, aes(y = y,x = spray),position = position_jitter(width = 0.1)) + # dots representing the raw data
  facet_wrap(~ lead) +
  geom_boxplot(data = data, aes(y = y, x = spray), position = position_nudge(x = -0.25), width = 0.25, outlier.shape = NA) + # boxplot
  geom_point(data = all_means, aes(y = response, x = spray), position = position_nudge(x = 0.15), size = 2, color = "red") + # red dots representing the adjusted means
  geom_errorbar(data = all_means, aes(ymin = asymp.LCL, ymax = asymp.UCL, x = spray), position = position_nudge(x = 0.15), color = "red", width = 0.1) + # red error bars representing the confidence limits of the adjusted means
  geom_text(data = all_means, aes(y = response, x = spray, label = str_trim(.group)), position = position_nudge(x = 0.25), color = "black", angle = 0) + # red letters 
  labs(y = "Counts", x = "Spray Treatment") +
  theme_classic()
Plot_NB

While the ANOVA test showed marginally significant interaction effect, the post-hoc tests showed clear significant differences among the interaction levels. This can happen where there is a lot of variation within groups where the ANOVA test doesn’t show significant differences in the overall model. The post-hoc comparisons only look at pairwise comparisons and therefore a fraction of the overall variance in the model.

The spray treatments had lower insect counts but also the lead treatment had lower counts without the spray treatment. There was no difference between the counts in the spray treatments with or without the lead treatment.

1.4.9 Note about Zero-Inflated Models

There are instances where count data will have excessive zeros where the poisson or negative binomial models will produce biased results. If you believe that your data might have an excess number of zeros, I would suggest you look into zero-inflated or hurdle models. These models require careful consideration about the context of the “zero” observations. We won’t be covering this topic further, but if you have questions about it please feel free to email us.

1.5 Binomial Distribution

For data that follow a binomial error distribution often have a binomial or dichotomous response. The example dataset follows the study of 800 observations among two pepper fields documenting the presence or absence of Phytophera disease. The response variable is the presence or absence of the disease, and the independent variables are field and soil moisture percent we will be reviewing to determine if they impact the probability of the disease being observed.

data <- read_csv(here("data","Logistic_Regressionl.csv"))
## Rows: 800 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): field, disease
## dbl (4): row, quadrat, water, leaf
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(data)
field row quadrat disease water leaf
F1 1 1 N 15.05 5
F1 1 2 N 14.32 2
F1 1 3 N 13.99 3
F1 1 4 N 13.85 0
F1 1 5 N 13.16 5
F1 1 6 N 11.81 5
data$field <- as.factor(data$field)

With logistic regression, it can be more difficult to interpret the coefficients, so a visual representation of the trends can be easier to understand. In order to do that we need to create a new variable that is 0 or 1 instead of “Y” or “N”.

data$disease_num <- if_else(data$disease == "Y", 1, 0)

1.5.1 Data Visualization

We can look at the presence of the disease across both fields.

ggplot(data, aes(x = field, y = disease)) +
  geom_jitter(height = 0.2, width = 0.2)

1.5.2 Fitting the Model and Model Assumptions

mod1 <- glm(disease_num ~ field, data = data, family = binomial(link = "logit"))
car::Anova(mod1, type = 2, test.statistic = "F") # No interaction so type two is OK
Sum Sq Df F value Pr(>F)
field 0.4978846 1 0.4966398 0.4811859
Residuals 800.0000000 798 NA NA
summary(mod1)
## 
## Call:
## glm(formula = disease_num ~ field, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##             Estimate Std. Error z value            Pr(>|z|)    
## (Intercept)  -1.8575     0.1463 -12.695 <0.0000000000000002 ***
## fieldF2       0.1423     0.2019   0.705               0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 658.74  on 799  degrees of freedom
## Residual deviance: 658.24  on 798  degrees of freedom
## AIC: 662.24
## 
## Number of Fisher Scoring iterations: 4
plot(simulateResiduals(mod1))

autoplot(mod1) # Not too informative for this type of model - just make sure that the type of data matches the binomial regression (e.g., presence/absence).

# Note about separation issue where one level of variable only has 1's or 0's.

1.5.3 Look at Estimated Marginal Means

means <- mod1 %>%
  emmeans(pairwise ~ field, type = "response") %>% # the response scale will be the probability of disease presence
    cld(Letters = letters)
means
field prob SE df asymp.LCL asymp.UCL .group
F1 0.1350 0.0170862 Inf 0.1048716 0.1721197 a
F2 0.1525 0.0179752 Inf 0.1204985 0.1911532 a

The non-significant result suggests that the presence of the disease is not different between the two fields.

1.5.4 Visualize the Results

library(ggpp)
ggplot(data, aes(x = field, y = disease_num)) +
  geom_jitter(height = 0.01) +
  geom_errorbar(data = means,
                aes(x = field, y = prob, ymin = (prob - SE), ymax = (prob + SE)), 
                width = 0.1, lwd = 1, position = position_dodge(width = 0.5)) +
  geom_point(data = means, aes(x = field, y = prob), 
             size = 2, position = position_dodge(width = 0.5)) +
    geom_text(data = means, aes(y = prob, x = field, label = str_trim(.group)), color = "black", position = position_dodgenudge(width = 1.2, x = -0.15), hjust = 0, angle = 0) + 
  theme_classic()
## Warning: `position_dodge()` requires non-overlapping x intervals.

There are no differences in disease prevalence between fields, but there are other variables to consider with this dataset. The other variable of interest includes the soil moisture percent as a potential covariate.

ggplot(data, aes(x = water, y = disease, color = field)) +
  geom_jitter(height = 0.2)
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).

1.5.5 Fitting the Model and Model Assumptions

mod1 <- glm(disease_num ~ field * water, data = data, family = binomial(link = "logit"))
car::Anova(mod1, type = 3, test.statistic = "F") 
Sum Sq Df F values Pr(>F)
field 26.671274 1 25.6585490 0.0000005
water 0.758459 1 0.7296598 0.3932534
field:water 36.854331 1 35.4549490 0.0000000
Residuals 818.062332 787 NA NA
summary(mod1)
## 
## Call:
## glm(formula = disease_num ~ field * water, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)   -2.61224    0.82857  -3.153     0.00162 ** 
## fieldF2       -5.82953    1.18699  -4.911 0.000000905 ***
## water          0.06673    0.07437   0.897     0.36953    
## fieldF2:water  0.61732    0.10802   5.715 0.000000011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648.80  on 790  degrees of freedom
## Residual deviance: 534.19  on 787  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 542.19
## 
## Number of Fisher Scoring iterations: 6
#Model Assumptions
plot(simulateResiduals(mod1))

1.5.6 Look at Estimated Marginal Means

trends <- mod1 %>%
  emtrends(pairwise ~ field, var = "water") %>%
    cld(Letters = letters)
trends
field water.trend SE df asymp.LCL asymp.UCL .group
F1 0.0667339 0.0743679 Inf -0.0790246 0.2124924 a
F2 0.6840556 0.0783472 Inf 0.5304980 0.8376132 b
# emtrends function used to compare the slopes of the lines between the two fields.

The significant result suggests that the trend line is different between the two fields. With logistic regression, it can be more difficult to interpret the coefficients, so a visual representation of the trends can be easier to understand.

1.5.7 Visualize the Results

ggplot(data, aes(x = water, y = disease_num, color = field)) +
  geom_jitter(height = 0.01) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"),
              formula = y ~ x, se = T, lwd = 1.5) +
  labs(y = "Probability of Disease", x = "Soil Moisture %") +
  theme_classic()
## Warning: Removed 9 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 9 rows containing missing values or values outside the scale range
## (`geom_point()`).

For field 1 there is a gradual trend describing the increase in probability of observing the disease as soil moisture % increases. For field 2 there is a more rapid increase in probability of observing the disease where the probability quickly increases to almost 100% between 10 and 15 % of soil moisture.

1.5.8 Note about Ordered Logistic Regression

Logistic regression can be further complicated when the response variable is coded as either multinomial (many categories) or ordinal (discrete categories along a continuum). You would use the polr() function from the MASS package to estimate the ordered logistic regression model. The function name comes from the proportional odds logistic regression, which hints at the proportional odds assumption (a.k.a. parallel regression assumption) of the model. We won’t be covering this topic further, but if you have questions about it please feel free to email us.

1.6 Practice Example

We are going to revisit the mead.lamb (Mead et al. 2002) data to use as practice. We can upload the data from the agridat package.

library(agridat)
data(mead.lamb)
dat <- mead.lamb

# Farm: three locations
# Breed: three breeds of lamb
# Lamb Class: four levels of class (number of live lambs per birth)
# Y: count of ewes in class

# Model specification steps

mod1 <- glm(y ~ farm * breed * lambclass, data = dat, family = poisson)
mod2 <- glm(y ~ farm*breed + breed * lambclass + farm * lambclass, data = dat, family = poisson)
mod3 <- glm(y ~ farm * breed + breed * lambclass, data = dat, family = poisson)
mod4 <- glm(y ~ farm * breed + farm * lambclass, data = dat, family = poisson)
mod5 <- glm(y ~ farm * breed + lambclass, data = dat, family = poisson)
mod6 <- glm(y ~ farm + breed + lambclass, data = dat, family = poisson)
mod7 <- glm(y ~ farm * breed, data = dat, family = poisson)

AIC(mod1, mod2, mod3, mod4, mod5, mod6, mod7)
df AIC
mod1 36 223.3074
mod2 24 213.8873
mod3 18 303.5781
mod4 18 206.1520
mod5 12 306.5457
mod6 8 373.5815
mod7 9 852.9800
# Model 4 is the best model given it's lowest AIC value
# Should also use theory to drive what terms should be in the model and what hypotheses you want to test

# Check for overdispersion and look at the residual plots
# Then complete the multiple comparisons tests - show all pairwise comparisons for both interactions (e.g., farm:lambclass, and farm:breed)
# Finally make a pair of figures that shows the results and the interpretation of the statistical analyses

check_overdispersion(mod4)
## # Overdispersion test
## 
##        dispersion ratio =  1.074
##   Pearson's Chi-Squared = 19.338
##                 p-value =  0.371
## No overdispersion detected.
simulateResiduals(mod4, plot = T)

## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
##  
## Scaled residual values: 0.4060429 0.7288599 0.3218639 0.6627756 0.6786652 0.6179234 0.3357164 0.7419022 0.6296398 0.2100956 0.8245195 0.08288212 0.8735348 0.756626 0.2582193 0.4837854 0.4102228 0.4112786 0.6805604 0.2090176 ...
mod4.nb <- glm.nb(y ~ farm * breed + farm * lambclass, data = dat)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(y ~ farm * breed + farm * lambclass, data = dat): alternation
## limit reached
anova(mod4, mod4.nb)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
18 18.8446 NA NA NA
18 18.8427 0 0.0019007 NA
AIC(mod4, mod4.nb)
df AIC
mod4 18 206.1520
mod4.nb 19 208.1576
all_means_FB <- mod4 %>%
  emmeans(~ farm:breed) %>%
  cld(Letters = letters, type = "response")
all_means_FB
farm breed rate SE df asymp.LCL asymp.UCL .group
8 F2 C 3.583796 0.8192890 Inf 2.289553 5.609651 a
9 F3 C 4.590603 0.9113328 Inf 3.110914 6.774097 ab
4 F1 B 7.818046 1.2371518 Inf 5.733258 10.660928 abc
2 F2 A 9.215476 1.6077453 Inf 6.546601 12.972378 bcd
5 F2 B 10.111425 1.7280226 Inf 7.233420 14.134519 bcd
7 F1 C 13.256687 1.6832473 Inf 10.336060 17.002587 cd
6 F3 B 17.417288 2.3205498 Inf 13.414447 22.614567 de
1 F1 A 25.493630 2.5453331 Inf 20.962644 31.003968 ef
3 F3 A 30.243972 3.6372401 Inf 23.892991 38.283104 f
all_means_FL <- mod4 %>%
  emmeans(~ farm:lambclass) %>%
  cld(Letters = letters, type = "response")
all_means_FL
farm lambclass rate SE df asymp.LCL asymp.UCL .group
11 F2 L3 1.211285 0.6072527 Inf 0.4534336 3.235777 a
12 F3 L3 1.541272 0.6330599 Inf 0.6890611 3.447471 a
2 F2 L0 4.239496 1.1435635 Inf 2.4986805 7.193127 ab
1 F1 L0 5.937368 1.3399401 Inf 3.8150091 9.240435 abc
4 F1 L1 10.093526 1.7582109 Inf 7.1741503 14.200883 bc
10 F1 L3 11.281000 1.8621120 Inf 8.1628719 15.590219 bc
3 F3 L0 11.302658 1.7788181 Inf 8.3026754 15.386616 bc
5 F2 L1 12.415666 1.9912184 Inf 9.0668149 17.001425 c
8 F2 L2 36.338536 3.5724079 Inf 29.9699889 44.060382 d
6 F3 L1 40.073061 3.6840219 Inf 33.4656540 47.985024 d
9 F3 L2 46.495026 4.0443228 Inf 39.2071782 55.137541 d
7 F1 L2 54.030052 4.3310308 Inf 46.1746233 63.221882 d